TITLE OF THE INVENTION 

APPARATUS AND METHOD FOR SPEECH RECOGNITION 

BACKGROUND OF THE INVENTION 

The present invention relates to a speech recognition apparatus 
and a speech recognition method for an adaptation to both noise 
and speaker. 

The main problems in automatic speech recognition exits in 
a background noise added to the speech to be recognized, and individual 
variation caused by phonetic organs or utterance habits of an 
individual speaker . 

In order to achieve a robust speech recognition capable of 
coping with these problems, the speech recognition methods called 
an HMM (Hidden Mar kov Model ) composition or also called a PMC (Parallel 
Model Combination) method have been studied (for example, see pages 
553-556 of IEEE ICASSP 1998 "Improved Robustness for Speech 
Recognition Under Noisy Conditions Using Correlated Parallel Model 
Combination" ) . 

At the pre-processing stage before performing a real speech 
recognition, the HMM composition method or the PMC method generates 
noise adaptive acoustic models (noise adaptive acoustic HMMs) as 
noise adaptive composite acoustic models by the composition of 
standard initial acoustic models (initial acoustic HMMs) and noise 
models (speaker's environmental noise HMM) generated from the 
background noise. 

In real speech recognition stages, each likelihood of noise 
adaptive acoustic models having been generated in a pre-processing 



is compared with feature vector series, which are obtained from 
a cepstrum transformation of the uttered speech including the 
additive background noise, to output the noise adaptive acoustic 
model with the maximum likelihood as a result of speech recognition. 

Technologies for speaker adaptation have been also studied 
diversely, and for example, a MAP estimation method or a MLLR method 
for renewing the mean vector and the covariance matrix of a model 
are known. 

A conventional speech recognition, however, has a problem of 
requiring a large amount of processing for performing 
noise-adaptation of all initial acoustic models in order to obtain 
noise adaptive acoustic models (noise adaptive acoustic HMMs) to 
be compared with the feature vector series. 

The required large amount of processing, which can not be 
accepted to keep high processing speed, hinders increasing the number 
of initial acoustic models . Thus, the lack of initial acoustic models 
obstructs the improvement of a recognition performance. It should 
be noted that it is possible to improve the efficiency of an 
environmental noise adaptation technology by using a clustering 
technique. However, it is hard to directly adapt well-known speaker 
adaptation technologies such as the MLLRmethod or the MAP estimation 
method to this environmental noise adaptation technology, that is, 
the coexistence of both noise and speaker adaptation technologies 
have been a subject to be solved. 

SUMMARY OF THE INVENTION 

The present invention has been achieved in view of the foregoing 



conventional problems . It is thus an object of the present invention 
to provide speech recognition apparatus and methods capable of 
reducing an amount of processing, which is required for the noise 
and speaker adaptation of initial acoustic models. 
5 According to a first aspect of the present invention, there 

is provided a speech recognition apparatus for recognizing speech 
by comparing composite acoustic models adapted to noise and speaker 
with a feature vector series extracted from an uttered speech. The 
speech recognition apparatus comprises a storing section for 

10 previously storing each representative acoustic model selected as 
a representative of acoustic models belonging to one of groups, 
each of the groups being formed beforehand by classifying a large 
number of acoustic models on a basis of a similarity, difference 
models of each group obtained from difference between the acoustic 

15 models belonging to one of the groups and the representative acoustic 
model of the identical group, and group information for corresponding 
the representative acoustic models with the difference models every 
the identical group. The speech recognition apparatus further 
comprises a generating section for generating each noise adaptive 

20 representative acoustic model of the each group by noise-adaptation 
executed to the representative acoustic model of the each group 
stored in the storing section, and a generating section for generating 
each composite acoustic model of the each group by composition of 
the difference model and the noise adaptive representative acoustic 

25 model using the group information. Additionally, the speech 

recognition apparatus comprises a renewal model generating section 
for generating noise and speaker adaptive acoustic models by 
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performing a speaker-adaptation of the composite acoustic model 
every identical group, using the feature vector series obtained 
from the uttered speech, and a model renewal section for replacing 
the difference models of the each group by renewal difference models 
5 which are generated by taking differences between the noise and 
speaker adaptive acoustic models and the noise adaptive 
representative acoustic models selected via the group information, 
thereby performing a speech recognition by comparing the feature 
vector series extracted from the uttered speech to be recognized 

10 with the composite acoustic model adapted to noise and speaker. 
Moreover, the composite acoustic model adapted to noise and speaker 
is generated by composition of the renewal difference model and 
the noise adaptive representative acoustic model , which is generated 
by a noise-adaptation of the representative acoustic model of the 

15 group including the renewal difference model selected via the group 
information . 

According to a second aspect of the present invention, there 
is provided a speech recognition apparatus for recognizing speech 
by comparing composite acoustic models adapted to noise and speaker 

20 with a feature vector series extracted from an uttered speech. The 
speech recognition apparatus comprises a storing section for 
previously storing each representative acoustic model selected as 
a representative of acoustic models belonging to one of groups, 
each of the groups being formed beforehand by classifying a large 

25 number of acoustic models on a basis of a similarity, difference 
models of each group obtained from difference between the acoustic 
models belonging to one of the groups and the representative acoustic 

A 



model of the identical group, and group information for corresponding 
the representative acoustic models with the difference models every 
the identical group. The speech recognition apparatus further 
comprises a generating section for generating each noise adaptive 
5 representative acoustic model of the each group by noise-adaptation 
executed to the representative acoustic model of the each group 
stored in the storing section, and a generating section for generating 
each composite acoustic model of the each group by composition of 
the difference model and the noise adaptive representative acoustic 

10 model using the group information. Additionally, the speech 

recognition apparatus comprises a recognition processing section 
for recognizing speech by comparing the composite acoustic models 
generated in the generating section for composite acoustic models 
with the feature vector series extracted from the uttered speech 

15 to be recognized, a renewal model generating section for generating 
noise and speaker adaptive acoustic models by performing a 
speaker-adaptation of the composite acoustic model every identical 
group, using the feature vector series obtained from the uttered 
speech, and a model renewal section for replacing the difference 

20 models of the each group by renewal difference models which are 
generated by taking differences between the noise and speaker 
adaptive acoustic models and the noise adaptive representative 
acoustic models selected via the group information, thereby the 
recognition processing section performs a speech recognition by 

25 comparing the feature vector series extracted from the uttered speech 
to be recognized with the composite acoustic model adapted to noise 
and speaker generated by composition of the noise adaptive 



representative acoustic model generated by noise-adaptation of the 
representative acoustic model of each group including the renewal 
difference model selected with the group information and the renewal 
difference model renewed by the renewal model generating section 
5 and the model renewal section, every repetition of the speech 
recognition . 

According to a third aspect of the present invention, there 
is provided a speech recognition method for recognizing speech by 
comparing a set of composite acoustic models adapted to noise and 

10 speaker with a feature vector series extracted from an uttered speech . 
The speech recognition method comprises the step of previously 
storing, in a storing section, each representative acoustic model 
selected as a representative of acoustic models belonging to one 
of groups, each of the groups being formed beforehand by classing 

15 a large number of acoustic models on a basis of a similarity, 

difference models of each group obtained from difference between 
the acoustic models belonging to one of the groups and the 
representative acoustic model of the identical group, and group 
information for corresponding the representative acoustic models 

20 with the difference models every the identical group. Further, the 
speech recognition method comprises the steps of generating each 
noise adaptive representative acoustic model of the each group by 
noise-adaptation executed to the representative acoustic model of 
the each group stored in the storing section, and generating each 

25 composite acoustic model of the each group by composition of the 
difference model and the noise adaptive representative acoustic 
model using the group information. Additionally, the speech 



recognition method comprises the steps of generating noise and. 
speaker adaptive acoustic models by performing a speaker-adaptation 
of the composite acoustic model every identical group, using the 
feature vector series obtained from the uttered speech, and replacing 
5 the stored difference models of the each group by renewal difference 
models which are generated by taking differences between the noise 
and speaker adaptive acoustic models and the noise adaptive 
representative acoustic models selected via the group information. 
Under the above-mentioned steps, the speech recognition is performed 

10 by comparing the feature vector series extracted from the uttered 
speech to be recognized with the composite acoustic model adapted 
to noise and speaker . Moreover, the composite acoustic model adapted 
to noise and speaker is generated by composition of the renewal 
difference model and the noise adaptive representative acoustic 

15 model, which is generated by a noise-adaptation of the representative 
acoustic model of the group including the renewal difference model 
selected via the group information. 

According to a fourth aspect of the present invention, there 
is provided a speech recognition method for recognizing speech by 

20 comparing a set of composite acoustic models adapted to noise and 
speaker with feature vector series extracted from an uttered speech . 
The speech recognition method comprises the step of previously 
storing, in a storing section, each representative acoustic model 
selected as a representative of acoustic models belonging to one 

25 of groups, each of the groups being formed beforehand by classing 
a large number of acoustic models on a basis of a similarity, 
difference models of each group obtained from difference between 



the acoustic models belonging to one of the groups and the 
representative acoustic model of the identical group, and group 
information for corresponding the representative acoustic models 
with the difference models every the identical group. Further, the 
5 speech recognition method comprises the steps of generating each 
noise adaptive representative acoustic model of the each group by 
noise-adaptation executed to the representative acoustic model of 
the each group stored in the storing section, and generating each 
composite acoustic model of the each group by composition of the 

10 difference model and the noise adaptive representative acoustic 
model using the group information. Additionally, the speech 
recognition method comprises the steps of recognizing a speech by 
comparing the composite acoustic models generated in the generating 
step for composite acoustic models with the feature vector series 

15 extracted from the uttered speech to be recognized, generating noise 
and speaker adaptive acoustic models by performing a 
speaker-adaptation of the composite acoustic model every identical 
group, using the feature vector series obtained from the uttered 
speech, and replacing the stored difference models of the each group 

20 by renewal difference models which are generated by taking 

differences between the noise and speaker adaptive acoustic models 
and the noise adaptive representative acoustic models selected via 
the group information. Under the above-mentioned steps, the 
recognition processing step performs a speech recognition by 

25 comparing the feature vector series extracted from the uttered speech 
to be recognized with the composite acoustic model adapted to noise 
and speaker generated by composition of the noise adaptive 



representative acoustic model generated by noise-adaptation of the 
representative acoustic model of each group including the renewal 
difference model selected with the group information and the renewal 
difference model renewed by the noise and speaker adaptive acoustic 
5 models generating step and the difference models replacing step, 
every repetition of the speech recognition. 

BRIEF DESCRIPTION OF THE DRAWINGS 

These and other objects and advantages of the present invention 
10 will become understood clearly from the following description with 
reference to the accompanying drawings, wherein: 

Fig . 1 is a block diagram for illustrating a structure of speech 
recognition apparatus according to the first embodiment of the 
present invention; 
15 Fig. 2 is an explanatory view for illustrating a generation 

principle of representative acoustic models and difference models; 

Fig. 3 is an explanatory view for illustrating a relationship 
among representative acoustic models, difference models and initial 
acoustic models; 

20 Fig. 4 is an explanatory view for illustrating a generation 

principle of noise adaptive composite acoustic models; 

Fig. 5 is an explanatory view for illustrating a generation 
principle of noise and speaker adaptive acoustic models for adapting 
to both noise and speaker, and a generation principle of a renewal 
25 difference model; 

Fig . 6 is a flowchart for illustrating steps before a difference 
model is renewed by a renewal difference model; 



Fig. 7 is a flowchart for illustrating a behavior in speech 
recognition; 

Fig. 8 is a block diagram for illustrating a structure of speech 
recognition apparatus according to the second embodiment of the 
present invention; and 

Fig. 9 is a flowchart for illustrating a behavior of speech 
recognition apparatus according to the second embodiment of the 
present invention . 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The preferred embodiments of the present invention will be 
explained referring to the attached drawings. 
( First Embodiment) 

The first embodiment of the present invention will be explained 
referring to Fig. 1 through Fig. 7. Fig. 1 is a block diagram showing 
a structure of a speech recognition apparatus of the present 
embodiment . 

As shown in Fig. 1, the speech recognition apparatus has a 
structure for recognizing speech using HMM, and comprises a storing 
section 1 which previously stores data of acoustic model or the 
like, an uttered speech environment noise model generating section 

2, a noise adaptive representative acoustic model generating section 

3, a composite acoustic model generating section 4, a renewal model 
generating section 5, a model renewal section 6, and a recognition 
processing section 7. 

Furthermore, the speech recognition apparatus employs a switch 
10 and a speech analyzing section 9 which generates and outputs 



the feature vector series V(n) in the cepstrum domain every 
predetermined frame period using a cepstrum transformation of an 
input signal v(t) from a microphone 8. 

The storing section 1 stores beforehand many acoustic models 
of sub-word unit such as phoneme generated by learning a standard 
uttered speech. 

Note that a large number of initial acoustic models (obtained 

only by learning a standard uttered speech) are not stored in the 

* 

primitive form, but representative acoustic models (C) and 
difference models (D) obtained by grouping or clustering each 
distribution (with mean vector and covariance matrix) of the large 
number of initial acoustic models are stored in a representative 
acoustic model storing unit la and a difference model storing unit 
lb respectively. More detailed descriptions will be given below. 

The large number of initial acoustic models are divided into 
groups Gi~G x by the clustering method as mentioned above. Then, 
assuming that, for example, the first (x=l) group Gi has qi pieces 
of initial acoustic models Si,i ~Si, q i as its members, one 
representative acoustic model C x and qi pieces of difference models 
di,i~di, q i are led therefrom. 

When the second (x=2 ) group G 2 has q 2 pieces of initial acoustic 
models S 2 ,i~S 2 , q i as its members, one representative acoustic model 
C 2 and q 2 pieces of difference models d 2 ,i-d 2 , q i are led therefrom. 
In the same manner, when the last (x=X) group G x has q x pieces of 
initial acoustic models S X/ i~S x ,qi one representative acoustic model 
C x and q x pieces of difference models d x ,i~d x , q i are led therefrom. 

As shown in Fig. 1, each representative acoustic model Ci~-C x 



belonging to each group Gi~G x is stored in the representative acoustic 
model storing unit la through grouping them, and each difference 
model such as di,i~di, q i, d 2 ,i-d 2 , q2 and d x ,i-d x , qX corresponding to each 
representative acoustic model is stored in the difference model 
5 storing unit lb under each group. 

Moreover, in Fig. 1, qi pieces of difference models di,i-di, q i 
corresponding to the representative acoustic model Ci of group d 
are denoted by a code D Xf and q 2 pieces of difference models d 2 ,i~d2, q 2 
corresponding to the representative acoustic model C 2 of group G 2 

10 are denoted by a code D 2 . In the same manner, q x pieces of difference 
models d x ,i~d x , qX corresponding to the representative acoustic model 
C x of group G x are denoted by a code D x . 

Furthermore, group information for managing the corresponding 
relationship between the representative acoustic models Ci~C x and 

15 the difference models Di~D x is stored in a group information storing 
unit lc. 

Fig. 2 is a schematic diagram for illustrating a generation 
principle of each representative acoustic model Ci~C x corresponding 
to each group Gi~G x and each difference model Di~D x corresponding 
20 to each representative acoustic model Ci~C x . The generation 
principle will be explained below referring to Fig. 2. 

Firstly, the grouping or clustering of a large number of initial 
acoustic models (initial acoustic HMMs) with a distribution S 
generates each group containing initial acoustic models similar 
25 to each other, and further the above-mentioned group information 
is also provided. 

Clustering methods such as LBG Method or Split Method can be 



used as a grouping method. The clustering is performed based on 
the similarity of the mean vectors of each distribution of initial 
acoustic models. 

The grouping may be performed by using advance information 
5 such as the similarity of phoneme corresponding to each model. For 
example, vowel model and consonant model may form two groups. 

The grouping of initial acoustic models may be performed by 
using the former and latter methods together. These clustering 
enable the grouping as shown in Fig. 2 schematically. 

10 For example, in the case of an acoustic model belonging to 

the xth group G x , the first acoustic model denoted by S x ,i is a 
distribution having its mean vector mS x ,i and its covariance matrix 
tfd x ,i (=aS x ,i), and also the second acoustic model denoted by S X/2 
is a distribution having its mean vector juS x , 2 and its covariance 

15 matrix ad x , 2 (=aS Xf 2) • In the same manner, the q x th acoustic model 
denoted by S x , qx is a distribution having its mean vector ji S x , qx and 
its covariance matrix a d x , qx (=aS x , qx ). 

An acoustic model belonging to the other groups such as d, 
G 2 etc. is also a distribution having a mean vector and a covariance 

20 matrix. 

A method for obtaining each representative acoustic model Ci~C x 
of each group Gi~G x will be explained. A case of obtaining a 
representative acoustic model C x of the Xth group G x will be explained 
below for convenience of explanation. 
25 As shown in Fig. 2, a representative acoustic model C x is a 

distribution having a mean vector fi C x originating from the base 
point Q and a distribution of covariance matrix a C x (indicated by 
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an ellipse in Fig. 2) corresponding to the mean vector /iC x . 

Assuming that a representative acoustic model C x is denoted 
by C x (mC x , o C x ) , the mean vector ju C x can be obtained as follow; 

qx 

,uCx =(l/qX) ' X^ S x,y ■ (D 

The covariance matrix a C x can be also obtained as follow; 

qx qx 

ffS x =d/qX) ^Ox^-h (l/q x ) ^x,y-MC x ) • (Mx,y-MC X ) T . (2) 

In the above expressions (1) , (2) , the variable X denotes the 
Xth group G x , the variable y denotes each acoustic model S X/y (1 
^£y^q x ) belonging to group G x , and the variable q x denotes the total 
number of acoustic models S X/y belonging to group G x . 

Representative acoustic models of the other groups Gi, G 2 etc. 
can be also obtained from the above expressions (1) and (2). 

Next, each difference model Di~D x corresponding to each group 
Gi~G x can be calculated by the next expression (3). 

How to obtain the difference model D x (that is, d x ,i, d x ,2~d x , q x) 
corresponding to the Xth group G x shown in Fig. 2 will be explained 
for convenience . 

The mean vector /zd x , y can be obtained from 

Aid X ,y = jUS X ,y " IX C X . (3) 

The covariance matrix ad x , Y can be determined by 

o d x , y = a S X/Y . (4 ) 

In the above expressions (3) and (4) , the variable X denotes 
the Xth group G X , the variable y denotes each acoustic model S x , y 
(l^y^q x ) belonging to group G x , and the variable q x denotes the 
total number of acoustic models S x , y belonging to group G x - 



The mean vector /i d x , y and the covariance matrix a d x ,YCletermined 
by the above expressions (3) and (4) compose the difference model 

More specifically, the difference model d x , 1 is the distribution 
with the mean vector /id x ,i and the covariance matrix ad x ,i, and the 
difference model d X/2 is the distribution with the mean vector ix 
d x ,2 and the covariance matrix ad x , 2 - In the same manner, the 
difference model d x , y (y=q x ) is the distribution with the mean vector 
/id X( y and the covariance matrix ad x , y , and thus the total number 
q x of the difference models d x ,i~d x , y can be determined. 

The representative acoustic model Ci~C x and the difference 
model Di (di,i~di, q i) ~D X (d x ,i~d x , qX ) are stored beforehand with 
correspondence to each group in the representative acoustic model 
storing unit la and the difference model storing unit lb, 
respectively. 

As shown schematically in Fig. 3, in more general expression, 
the initial acoustic model S x , y corresponding to the difference model 
d x , y can be determined by composition of the yth difference model 
d x , y belonging to the Xth group G x and the representative acoustic 
model C x belonging to the identical group as that of the difference 
model d x , y . On the basis of this relation, the representative acoustic 
model C x (l^x^X) and the difference model D x (l^x^X) corresponding 
to each group G x (l^x^X) are stored in the storing units la and 
lb respectively, and are managed with correspondence to each group 
based on the stored group information. 

In the present embodiment, the processing of the 
above-mentioned composition is realized by the following expression 



(5), (6); 

ju d x , y + V C x = /iS X( y , (5) 

O d X/ y = O S X/ y • (6) 

That is, the mean vector is obtained by addition, and the 
5 covariance matrix is obtained only by replacement. 

For convenience of explanation, each distribution S x , y of 
initial acoustic models is identified by numbering of the yth 
distribution of a group G x . However, in reality, the distribution 
of an. initial acoustic model is corresponded to a HMM, and each 
10 distribution of the difference model is also corresponded to each 
HMM to be stored. 

Group information B, which includes the relationship between 
each distribution of the initial acoustic model corresponding to 
each HMM and the group to which the distribution belongs, is stored 
15 in a group information storing unit lc. 

For example, the distribution of the initial acoustic model 
corresponding to the HMM number i, the state number j and the mixture 
number k is denoted by S m i,j, k , and each difference model corresponding 
to the above distribution is denoted by d m i,j, k . Furthermore, the 
20 cluster to which the distribution of the initial acoustic model 
S m i,j, k and each difference model d m i,j, k belong is denoted by £ , then 
the group information B m ifj , k indicating the group to which the 
distribution S m i,j, k belongs is denoted by 

B m ipjik = 0 . —(7) 
25 Thus, the corresponding relationship among the initial 

acoustic models, the difference models, and the group to which these 
models belong, can be obtained by the cluster information B m . 



A noise adaptive representative acoustic model generating 
section 3 employs Jacobian adaptation method as a noise adaptive 
method. The representative acoustic model C of each group is renewed 
and stored by an initial composite acoustic model, which is composed 
5 of the initial noise model (denoted by N s ) formed beforehand and 
the representative acoustic model of each group, using the HMM 
composition . 

The Jacobian matrix J of each group, which is determined from 
the initial noise model N s and the renewed representative acoustic 

10 model C, and the initial noise model N s are stored, respectively, 
and supplied to the noise adaptive representative acoustic model 
generating section 3. 

An uttered speech environment noise model generating section 
2 generates uttered speech environment noise models (uttered speech 

15 environment noise HMMs ) N based on the background noise of speech 
environment during non-uttered speech period. 

During the non-uttered period when a speaker does not yet utter , 
the background noise at speech environment is collected through 
a microphone 8. The speech analyzing section 9 generates feature 

20 vector series V(n) of background noise of every predetermined frame 
period from the collected signal V(t) . The feature vector series 
V(n) are applied to the uttered speech environment noise generating 
section 2 as the feature vector series N(n) 1 of background noise 
by switching the switch 10. Then, the uttered speech environment 

25 noise model generating section 2 generates the uttered speech 

environment noise model N by learning the feature vector series 
N (n) 1 . 



The noise adaptive representative acoustic model generating 
section 3 generates noise adaptive representative acoustic models 
(noise adaptive representative acoustic HMMs) Ci N ~C x N corresponding 
to each group Gi~G x , using noise-adaptation of representative 
5 acoustic models Ci~C x to uttered speech environment noise models 
N, and then feeds them to the composite acoustic model generating 
section 4. 

The method of noise-adaptation employs so-called 
noise-adaptation method for the superposition of the uttered speech 
10 environment noise model upon the distribution of representative 
acoustic model , using HMM composition, Jacobian adaptation method 
or the like. 

The HMM composition calculates the noise adaptive 
representative acoustic model C X N of each group using uttered speech 
15 environment noise models N and the representative acoustic model 
C x of each group . 

Jacobian adaptation method calculates the noise adaptive 
representative acoustic model C X N using the representative acoustic 
model C x of each group, which is renewed by the initial composite 
20 model, the initial noise N s , the uttered speech environment noise 
model N and the Jacobian matrix J of each group. 

The noise-adaptation of the representative acoustic model C x 
of each group G x will be described hereinafter more generally. When 
the background noise is assumed to be stationary and the noise model 
25 N to be a model with 1 state and 1 mixture number, a representative 
acoustic model C x is adapted to noise to become the noise adaptive 
representative acoustic model C X N , by using noise-adaptation 
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processing such as the HMM composition scheme or the Jacobian 
adaptation method. The mean vector and covariance matrix of the 
representative are transformed to juC x N and o C X N respectively. 

When the noise model N is to be a model with at least 2 state 
5 and at least 2 mixture number, the representative acoustic model 
C x corresponds to at least two noise adaptive distributions, that 
is, the representative acoustic model C x corresponds to C x ,i N , C X ,2 N 



The composite acoustic model generating section 4 generates 
10 a plurality of composite acoustic models (composite acoustic HMMs) 
M by the composition of each difference model stored (denoted by 
D in Fig. 1) in the difference model storing unit lb and each noise 
adaptive representative acoustic model (denoted by C N in Fig. 1) 
with regard to each group Gi~G x . 
15 More generally described, the noise adaptive representative 

acoustic model generating section 3 generates noise adaptive 
representative acoustic models C X N (l^x^SX) corresponding to each 
group G x (l^x^SX) , then, the composite acoustic model generating 
section 4 generates q x pieces of composite acoustic models M X/ i~M x , y , 
20 which are equivalent to noise-adaptation of initial acoustic models 
S x ,i~S x , y , by the composition of each difference model d x ,i~d x , y (y=q x ) 
and each noise adaptive representative acoustic model C X N (l^x 

Fig. 4 is a schematic drawing illustrating the structure of 
25 a set of composite acoustic models M generated as described above. 
As a representative example, the structure of composite acoustic 
models Mi,i~Mi, y generated from the representative acoustic model 
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C x and difference models di,i~di, y (y=q x ) belonging to the group Gx 
are shown. 

In Fig. 4, the above-mentioned composition is illustrated 
without the effect of the covariance matrix, for easy understanding. 
5 The mean vector and the covariance ofasetof composite acoustic 

models M x , y are denoted by \x M x , y and a M x , y , respectively. In the case 
of the composition of the noise adaptive representative acoustic 
model and the difference model, when the change in the variance 
of representative acoustic models caused by noise-adaptation is 
10 not taken into consideration, the mean vector ju M x , y and the covariance 
matrix oM x , y of the set of composite acoustic models M x , y are calculated 
by the following expressions; 

AiM x , y = Md x , y + mC x \ (8) 

aM x , y = ad X/y . - — (9) 
15 On the other hand, when the change in the variance of 

representative acoustic models caused by the noise-adaptation is 
taken into consideration, the mean vector yuM x , y and the covariance 
matrix aM x , y are calculated by the following expressions; 

juM x , y = Md X/y + (JC x Na (1/2) aC x A (-l/2) MC X N , —-(10) 
20 aM X/ y = oCv%C x A (-1) ad x ,y - —(H) 

Because the most influential factor in a speech recognition 

performance is the mean vector £*Mx, y of the distribution, the mean 
vector fi M x , y and the covariance matrix a M x , y of the composite acoustic 
model M x , y are determined by the expression (8), (9) respectively, 
25 each of which does not include the change in the covariance matrix 
of the representative acoustic model due to noise-adaptation. In 
the present embodiment, the mean vector /zM x , y and the covariance 



matrix o M x , y of the set of composite acoustic models M x , y are calculated 
by the above expressions (8), (9), thereby reducing an amount of 
processing for calculation to acquire a noise adaptive performance. 

The set of difference models Di (di,i~di, q i) , D 2 (d 2 ,i-d 2 ,q2) "" 
D x (d x ,i-d X/ qx) stored in the difference model storing unit lb are 
renewed by renewal difference models which are generated using the 
renewal model generating section 5 and the model renewal section 
6. The detailed descriptions will be given below. 

As shown in Fig. 1, the difference model before renewal is 
denoted by D, and the difference model after renewal is denoted 
by D" . The composite acoustic model being composed of the difference 
model Dbef ore renewal and the noise adaptive representative acoustic 
model C N is denoted by M, and that composed of the renewed difference 
model D" and the noise adaptive representative acoustic model C N 
is denoted by M" . 

The renewal model generating section 5 generates a noise and 
speaker adaptive acoustic model (noise and speaker adaptive acoustic 
HMM) R by speaker-adaptation of the composite acoustic model M to 
the feature vector series V(n) using such speaker adaptive methods 
as MLLR or MAP method. 

The speaker-adaptation of the present embodiment makes use 
of the speaker utterance of text sentences or the like suitable 
for the speaker-adaptation. 

Each feature vector series of every predetermined frame period 
having characteristics of the uttered speech output from the speech 
analyzing section 9, which analyzes the speech through the microphone 
8 during the utterance period, is fed to the renewal model generating 



section 5 through changing over the switch 10 as shown by a dotted-line 
in Fig . 1 - The composite acoustic model M generated in the composite 
acoustic model generating section 4 is also applied to the renewal 
model generating section 5 through the other dotted-line route in 
Fig. 1. Then, the renewal model generating section 5 generates 
a noise and speaker adaptive acoustic model R by the 
speaker-adaptation of the composite acoustic model M to the feature 
vector series V(n). 

Fig. 5 is a schematic drawing for illustrating a generation 
principle of the noise and speaker adaptive acoustic model R adapted 
to both noise and speaker. As a typical example, the generation 
of the noise and speaker adaptive acoustic model R x ,i~-R x , y from the 
composite acoustic model M x ,i~M x , y , which is composed of the 
representative acoustic model C x of group G x and the difference model 
D x (d x ,i~d x ,y) on the basis of the equations (8)and (9), is illustrated . 
The covariance matrix is not illustrated for simple explanations. 

The noise and speaker adaptive acoustic model R x ,i having a 
distribution with a mean vector juR x ,i and a covariance matrix a 
R Xf i (omitted in this figure) is generated by using the calculation 
of the expressions (8) and (9) . In the same manner, the noise and 
speaker adaptive acoustic model R x , y having a distribution with a 
mean vector juRx, y and a covariance matrix aR x , y (not shown) is 
generated. 

Furthermore, the other of the noise and speaker adaptive 
acoustic models corresponding to the groups G1,G2 ••■ are generated 
by using the expressions (8) and (9) , and all the noise and speaker 
adaptive acoustic models R are supplied to the model renewal section 
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The model renewal section 6 generates the renewal difference 
model D ,f adapted to speaker by using the noise and speaker adaptive 
acoustic model R generated at the renewal model generating section 
5, the noise adaptive representative acoustic model C N generated 
at the noise adaptive representative acoustic model generating 
section 3 and the difference model D before renewal stored in the 
difference model storing unit lb, to renew the difference model 
D before renewal with the renewal difference model D". 

The generation principle of the renewal difference model D x " 
determined by the noise and speaker adaptive acoustic model R x of 
the group G x , the noise adaptive representative acoustic model C X N 
and the difference model before renewal D x will be explained. Each 
mean vector /jl d x , i"~ M d x , y " and the covariance matrices o d x ,i"~ a d x , y " 
of the renewal difference model D x " (d X/ i"~d x , y " ) can be determined 
by the following expressions; 

Atd X/y "=a x , y ( AiRx,y-aC x N ^ (1/2) aC x A (-1/2) ja C X N ) + (1- a x , y ) /xd x , y (12) 

ad x , y "= <*x, y • ( o C X N • aC x A (-D cr R x , y ) + (1- a x ,y) ad x , y . (13) 

The above expressions (12), (13) show the method for 
noise-adaptation of the covariance matrices. When the 
noise-adaptation of the covariance matrices is not performed, the 
mean vector and the covariance matrices can be determined by the 
following expressions ; 

ix d x , y "=a x , y • (juRx, y -MC x N ) + (l-a x , y ) Aid x , y , (14) 

ad x , y "= <* X/y • a R x , y + ( 1- a x , y ) a d x , y . (15) 

Furthermore, when the speaker-adaptation of the covariance 
matrices is not also performed, the mean vector and the covariance 



matrices can be determined by the following expressions; 

IX d X ,y"=ax,y • (jLlR X ,y ~ H C x N ) + ( 1 " « X , y ) V d X , y , (16) 

adx ( y M = ad x , y . (17) 

In the case of the speaker-adaptation, the adaptation effect 
upon the mean vector is large, but the adaptation effect upon the 
covariance matrix is small. This enables the use of the above 
expression (16), (17), which are applicable to the case of not 
performing the speaker-adaptation of the covariance matrix, for 
determining each mean vector ix d x ,i"~ M d x , y " and covariance matrices 

ad x ,i n -ad X/ y" of the renewal difference model d x , i" ~d x , y " , thereby 
reducing amounts of operation and acquiring the effect of the 
speaker-adaptation. Thus, the present embodiment determines the 
renewal difference model d X/1 "~d x , y M based on the above expressions 

(16), (17). 

In addition, the coefficient a x , y in the expressions (16) , (17) 
is a weighted coefficient for adjusting the renewal difference model 
d X/y obtained from the noise and speaker adaptive acoustic model 
R x , y and the composite acoustic model. M x , y , and its range is 0.0 ^ 
a x , y ^1.0. 

The weighted coefficient a x , y may be a predetermined value 
in the above-mentioned range, or may be changed every adaptation 
process like the weighted coefficient of MAP estimation method. 

The renewal difference model d x ,i" of the group G x is obtained 
as a distribution with the mean vector /zd x ,i", which is determined 
by vector sum of the vector a x , y • (M\ y -juC x N ) of the first term 
in the right side of the expression (16) and the vector (l-a x ,y) 
Ijl d x , y of the second term, and the covariance matrix a d x ,i determined 



by the expression (17), as shown in Fig. 5. The other renewal 
difference models can be also determined in the same manner. 

The model renewal section 6 determines the renewal difference 
models Di"~D x " corresponding to each group Gi~G x , to renew the 
5 difference models Di~D x before renewal by the renewal difference 
models Di"-D x ". 

After the renewal of the difference model storing unit lb by 
the renewal difference model D", the recognition processing section 
7 recognizes uttered speech of a speaker from the beginning of real 

10 speech recognition. 

When speech is not yet uttered after the beginning of speech 
recognition processing, the composite acoustic model generating 
section 4 generates the composite acoustic model M" adapted to both 
noise and speaker corresponding to all the groups Gi^G x by the 

15 composition of the noise adaptive representative acoustic model 
C N generated in the noise adaptive representative acoustic model 
generating section 3. and the renewal difference model D" . 

Next, during the period of speech utterance, the speech 
analyzing section 9 generates the feature vector series V(n) of 

20 uttered speech involving the background noise, to supply the feature 
vector series V(n) to the recognition processing section 7 through 
changing over the switch 10. 

The recognition processing section 7 compares the feature 
vector series V(n) with the word or sentence model series generated 

25 from the composite acoustic model M", to output the model of the 
composite acoustic model M" with the maximum likelihood as a 
recognition result . 
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The behaviors of the speech recognition apparatus will be 
explained below referring to the flow charts in Fig. 6 and Fig. 
7 . 

Fig. 6 shows the behavior for renewing the difference model 
5 D by the renewal difference model D", which is performed before 
the steps of recognizing speech. Fig. 7 shows the behavior for 
recognizing speech using the renewal difference model D" . 

As shown in Fig . 6, when the renewal processing begins , firstly 
at the step S100, the noise adaptive representative acoustic model 
10 generating section 3 generates the noise adaptive representative 
acoustic model C N by the adaptation of the representative acoustic 
model C to noise. 

More specifically, the speech analyzing section 9 supplies 
the feature vector series N (n) ' of the background noise during a 
15 non-utterance period to the uttered environment noise model 

generating section 2, wherein the uttered environment noise models 
N are generated by learning the feature vector series N(n)'. 

Then, the noise adaptive representative acoustic model 
generating section 3 generates the noise adaptive representative 
20 acoustic model C N by using the noise-adaptation of the representative 
acoustic model C to the uttered environment noise model N. 

At the next step S102, the composite acoustic model generating 
section 4 generates the composite acoustic model M by the composition 
of the noise adaptive representative acoustic model C N and the 
25 difference model d before renewal. 

Thus, at the step S102 , the composite acousticmodel M is adapted 
only to noise, and is not yet adapted to speaker. 
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At the step S104, the renewal model generating section 5 
executes the adaptation of the composite acoustic model M to the 
uttered speech of a speaker. 

That is, while a speaker utters text sentences or the like, 
5 the speech analyzing section 9 supplies the feature vector series 
V(n) of the uttered speech to the renewal model generating section 
5 through changing over the switch 10. Then, the renewal model 
generating section 5 generates the noise and speaker adaptive 
acoustic model R by the speaker-adaptation of the composite acoustic 
10 model M to the feature vector series V(n). 

Thus, the noise and speaker adaptive acoustic model R adapted 
to both noise and speaker is generated at the step S104 as shown 
in Fig. 5. 

At the next step S106, the model renewal section 6 generates 
15 the renewal difference model D" adapted to noise and speaker by 
using the noise and speaker adaptive acoustic model R, the noise 
adaptive representative acoustic model C N and the difference model 
D before renewal. 

At the next step S108, the model renewal section 6 renews the 
20 difference model D (before renewal) in the difference model storing 
unit lb with the renewal difference model D", so that the renewal 
processing is finished. 

As mentioned above, the embodiment of the present invention 
does not employ the individual noise and speaker adaptation of the 
25 initial acoustic model, but applies the noise-adaptation of only 
the representative acoustic model C to generate the noise adaptive 
representative acoustic model C N . Then, the composite acoustic 



models M generated by the composition of the noise adaptive 
representative acoustic models C N and the difference models D, are 
employed in the speaker-adaptation, so that the amounts of processing 
for adapting to noise and speaker can be remarkably reduced. 
5 In the renewal processing, the renewal difference model D" 

having been adapted to noise and speaker is generated to be stored 
in the difference model storing unit lb as a replacement of the 
old difference model. This causes also remarkable reduction of 
amounts of processing in speech recognition as described below, 

10 so that rapid speech recognition becomes possible. 

Next, the behavior for recognizing speech will be explained 
referring to Fig. 7. 

In the speech recognition apparatus, the processing of speech 
recognition starts when receiving a command of a speaker. At the 

15 step S200 in Fig. 7, the noise adaptive representative acoustic 
model generating section 3 generates the noise adaptive 
representative acoustic model C N by the noise-adaptation of the 
representative acoustic model C. 

More specifically, during non-utterance period (in which the 

20 speaker yet utters nothing) , the uttered environment noise model 
generating section 2 generates the uttered environment noise model 
N by learning the feature vector series N(n)' of the background 
noise from the speech analyzing section 9. Then, the noise adaptive 
representative acoustic model generating section 3 generates the 

25 noise adaptive representative acoustic model C N by noise-adaptation 
of the representative acoustic model C to the uttered environment 
noise model N. 



At the step S202, the composite acoustic model generating 
section 4 generates the composite acoustic model M" adapted to noise 
and speaker by the composition of the noise adaptive representative 
acoustic model C N and the renewal difference model D" . 
5 Then, at the step S204, the recognition processing section 

7 compares the feature vector series V(n) of the uttered speech 
with the word or sentence model generated from the composite acoustic 
model M" , to recognize the uttered speech. 

That is, when the speaker begins to utter any speech, the switch 

10 10 is connected to the recognition processing section 7, and the 
feature vector series V(n) of the uttered speech involving the 
background noise, which is output from the speech analyzing section 
9, is supplied to the recognition processing section 7. Then, the 
recognition processing section 7 compares the feature vector series 

15 V(n) with the word or sentence model series generated from the 
composite acoustic model M" . Next, the model of the composite 
acoustic model M" with the maximum likelihood, which corresponds 
to the word or sentence, is output as a speech recognition result 
at the step S206. 

20 As mentioned above, the embodiment of the present invention 

does not employ the individual noise and speaker adaptation of so 
called the initial acoustic models, but generates the composite 
acoustic models M" adapted to noise and speaker by the composition 
of the noise adaptive representative acoustic models C N and the 
25 renewal difference models D 7/ . As a result, the amounts of processing 
for adapting to noise and speaker can be extremely reduced. 

In a conventional speech recognition, a speaker-adaptation 

29 



accompanies an adaptation to the uttered speech environment noise, 
and thus an acoustic model to be adapted only to speaker necessarily 
involves the effect of adapting to environmental noise. That is, 
an acoustic model including both the speaker-adaptation and the 
noise-adaptation in full is compared with a feature vector series 
V(n) of uttered speech. As a result, an improvement of a speech 
recognition rate is hindered. 

In the present embodiment, however, the acoustic model adapted 
to speaker generates the renewal difference model D" . Since the 
composite acoustic model M" to be compared with is generated from 
the renewal difference model D", the effect of the noise-adaptation 
can be decreased. Thus, the synergistic effect of the noise and 
speaker adaptation can be acquired to achieve a higher speech 
recognition rate . 
( Second Embodiment ) 

The second embodiment of the present invention will be 
explained, hereinafter, referring to Fig. 8 and Fig. 9. Fig. 8 is 
a drawing for illustrating the structure of the speech recognition 
apparatus of the present embodiment. Fig. 8 has the same reference 
numerals and codes as those of Fig. 1 with regard to members having 
the same function. 

The difference between the speech recognition apparatus of 
the second embodiment and that of the first embodiment is as follows . 
In the speech recognition apparatus of the first embodiment, the 
speech recognition is performed after the generation of the renewal 
difference model D" adapted to noise and speaker as explained 
referring to the flowcharts in Fig. 6 and Fig. 7. On the other hand, 
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the speech recognition apparatus of the present embodiment executes 
the speech recognition and the generation of the renewal difference 
model D", simultaneously, by the renewal processing of the renewal 
model generating section 5 and the model renewal section 6. 
5 The behavior of the speech recognition apparatus will be 

explained with reference to the flowchart of Fig. 9. 

As shown in Fig. 9, when the speech recognition processing 
begins, first at the step S300, the noise adaptive representative 
acoustic model generating section 3 generates the noise adaptive 
10 representative acoustic model C N by the adaptation of the 
representative acoustic model C to noise. 

That is, the speech analyzing section 9 supplies the feature 
vector series N(n)'of the background noise during a non-utterance 
period to the uttered environment noise generating section 2, 
15 wherein the uttered environment noise models N is generated by 
learning the feature vector series N(n) f . 

Then, the noise adaptive representative acoustic model 
generating section 3 generates the noise adaptive representative 
acoustic model C N by using the noise-adaptation of the representative 
20 acoustic model C to the uttered environment noise model N. 

At the next step S302, the composite acoustic model generating 
section 4 generates the composite acoustic model M by the composition 
of the noise adaptive representative acoustic model C N and the 
difference model D before renewal. 
25 Then, at the step S304, the recognition processing section 

7 compares the feature vector series V(n) of the uttered speech 
with the word or sentence model generated from the composite acoustic 
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model M, to recognize the uttered speech. 

That is, when a speaker begins to utter any speech, the switch 
10 is connected to the recognition processing section 7, and the 
feature vector series V(n) of the uttered speech generated in the 
5 speech analyzing section 9 is supplied to the recognition processing 
section 7 . The recognition processing section 7 compares the feature 
vector series V(n) with the model series such as word or sentence 
generated from the composite acoustic model M, to output the model 
of the composite acoustic model M with the maximum likelihood as 

10 a speech recognition result RCG. 

At the step S306, the likelihood values of the upper rank 
candidates as the recognition result are also output, thereby 
determining the reliability of the recognition result by comparing 
them with a predetermined standard. 

15 At the next step S3 0 8 , whether the recognition result is correct 

or incorrect is determined. If correct, the processing step goes 
to the step S310, and if not, the processing step jumps to the end. 
Methods for determining the reliability of recognition result have 
been developed diversely, but its explanation is omitted here. 

20 At the step S310 and S312, the renewal model generating section 

5 performs the the speaker-adaptation using the composite acoustic 
model M, the feature vector series V(n) of uttered speech, and the 
recognition result RCG. Then, the model renewal section 6 generates 
the renewal difference model D" , and renews the difference model 

25 D before renewal. 

That is, at the step S310, the renewal model generating section 
5 determines the recognized model series using the recognition 
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results RCG, to perform the speaker-adaptation of the composite 
acoustic model M using the feature vector series V(n). 

For example, when a speaker utters "Tokyo" and the recognition 
result of the word "Tokyo" is output from the recognition processing 
5 section 7, the renewal model generating section 5 performs the 
speaker-adaptation of the composite acoustic model M of the word 
"Tokyo" using the feature vector series V(n) of the uttered word 
"Tokyo", so that the noise and speaker adaptive acoustic model R 
adapted to noise and speaker can be generated. 
10 Themodel renewal section 6, furthermore, generates the renewal 

difference model D" corresponding to the recognition result RCG 
using the noise and speaker adaptive acoustic model R, the noise 
adaptive representative acoustic model C N and the difference model 
D before renewal. 

15 At the step S312, the model renewal section 6 replaces the 

difference model (before renewal ) D corresponding to the recognition 
result RCG by the renewal difference model D" . 

When the recognition result RCG is the word "Tokyo" as mentioned 
above, the difference model D before renewal of the word "Tokyo" 

20 is renewed by the renewal difference model D" . 

The speech recognition apparatus of the present embodiment, 
as described above, performs the speech recognition using the 
representative acoustic model C and the difference model D stored 
beforehand in the representative acoustic model storing unit la 

25 and the difference model storing unit lb, respectively, and can 
simultaneously generate the renewal difference model D" adapted 
to noise and speaker. 



The difference model D before renewal is renewed gradually 
withmore andmore accuracy by the speaker adaptive renewal difference 
model, as the number of speech recognition increases. Thus, the 
composite acoustic model M generated at the step S302 in Fig. 9 
5 becomes gradually the composite acoustic model with the adaptation 
to noise and speaker. 

The excellent effect of the improvement in recognition rate 
is achieved with increase in the number of usage of this speech 
recognition apparatus, because the recognition processing section 
10 7 performs the speech recognition comparing the composite acoustic 
model M" having the speaker-adaptation with the feature vector series 
V(n) from the uttered speech. 

In the first and second embodiment of the present invention, 
the group information may be renewed whenever the difference model 
15 D is renewed by the renewal difference model D" . 

That is, in the first embodiment, after the completion of the 
processing at the step S108 in Fig. 6, both the group information 
and the renewal difference model may be renewed in order to make 
the acoustic model belong to the group to which the most similar 
20 representative acoustic model belongs, based on the similarity 
between the composite model S", which is composed of the 
representative acoustic model C and the renewal difference model 
D", and the representative acoustic model C. 

The renewal difference model d x , y " is stored in the form of 
25 d m i,j, k " for the HMM number i, the state number j and the mixture 
number k, as mentioned previously. 

The cluster to which the d m i, j/k ,f belongs is stored as the cluster 



information B m i,j, k , as previously mentioned. For example, assume 
that the cluster to which the d m i,j, k " belongs is j3 , that is, B m i, j/k = 
j3 , then, the representative acoustic model of the cluster to which 
the d m i,j,k" belongs is C^. Therefore, the composite model S m i/j/k " 
5 is obtained from the composition of d m i,j, k " and C^. 

Assume that the most similar representative acoustic model 
is not Cfl, but C v , as the result of the comparison based on the 
similarity between S m i, j, k " and all the representative acoustic models . 
In this case, the renewal difference model is replaced by 

10 d^^-S^k" - C 7 . 

The cluster information is also replaced by B m i/j/k =y . 
The renewed difference information and group information is 
stored in the storing unit lc. 

By the grouping or clustering for the composite model S" , the 
15 group information B, the representative acoustic model C, and the 
renewal difference model D" can also be renewed. However, the 
clustering operation needs enormous calculations and is not 
effective . 

In the case of the employment of Jacobian adaptation as a noise 
20 adaptive method, the renewal of the representative acoustic model 
C needs more enormous calculations for forming the initial composite 
models . 

It is effective to renew only the difference model and the 
group information in order to obtain the above-mentioned effect 
25 by small amount of calculation. 

In the second embodiment, after the completion of the 
processing at the step S310 in Fig. 9, both the group information 



and the renewal difference model may be renewed in order to make 
the acoustic model belong to the group to which the most similar 
representative acoustic model belongs, on the basis of the similarity 
between the composite model S", which is composed of the 
representative acoustic model C and the renewal difference model 
D", and the representative acoustic model C. 

The renewal difference model d X/ y" is stored in the form of 
d m i,j, k " for the HMM number i, the state number j and the mixture 
number k, as mentioned previously. 

The cluster to which the d m i, jrk " belongs is stored as the cluster 
information B m i,j, k , as previously mentioned. For example, assume 
that the cluster to which the d m i,j, k " belongs is 0 , that is, B m i,j, k = 
jS , then, the representative acoustic model of the cluster to which 
the d m i,j, k " belongs is . Therefore, the composite model S m i/j/k " 
is obtained from composition of d m i, j/k " and C B . 

Assume that the most similar representative acoustic model 
is not , but C 7 , as the result of comparison based on the similarity 
between S m i,j, k " and all the representative acoustic models. In this 
case, the renewal difference model is replaced by 

dm "=:C m M _ p> 

i, j , k ^>i,j,k W " 

The cluster information is also replaced by B m i/j/k =7 . 

The renewed difference information and group information is 
stored in the storing unit lc. 

By the grouping or clustering for the composite model S", the 
group information B, the representative acoustic model C and the 
renewal difference model D" can also be renewed. However, the 
clustering operation needs enormous calculations and is not 



effective . 

In the case of the employment of Jacobian adaptation as a noise 
adaptive method, the renewal of the representative acoustic model 
C needs more enormous calculations for forming the initial composite 
5 models. 

It is effective to renew only the difference model and the 
group information in order to obtain the above-mentioned effect 
by small amount of calculation. 

As mentioned above, the first and second embodiments enable 

10 the speech recognition rate to be improved, an addition to the 
reduction of amount of processing for recognition. 

In other words, the speech recognition apparatus and the speech 
recognition method of the first embodiment generate the renewal 
difference models to store in the storing unit 1 before performing 

15 the speech recognition using the renewal difference model. That 
is, a large number of acoustic models is divided into groups or 
clusters on the basis of the similarity, to obtain the group 
information, the representative acousticmodel , and difference model 
every group or cluster . These models or information are stored every 

20 the identical group in the storing section 1. 

Before the processing of real speech recognition, the renewal 
difference models, in which the adaptation to noise and speaker 
is executed, are generated, to renew the difference models already 
stored in the storing section 1. 

25 When replacing the difference model in the storing section 

1 by the renewal difference model, first, the noise adaptive 
representative acoustic model every identical group is generated 



by executing the noise-adaptation to the representative acoustic 
models every identical group stored in the storing section 1. 

Next, each of the composite acoustic models adapted to noise 
is generated by the composition of each noise adaptive representative 
5 acoustic model and each difference model of the identical group. 

Furthermore, the noise and speaker adaptive acoustic model 
is generated by the speaker-adaptation of the noise adaptive 
composite acoustic model to the feature vector series from the uttered 
speech . 

10 Then, the difference model stored in the storing section 1 

is replaced by the renewal difference model, which is generated 
from the difference between the noise and speaker adaptive acoustic 
model and the noise adaptive representative acoustic model. 

when performing the speech recognition in the first embodiment, 

15 during non-utterance period, first, adaptation of the representative 
acoustic model to environmental noise generates the noise adaptive 
representative acoustic model. Then, the composite acoustic model 
adapted to noise and speaker is generated by the composition of 
the noise adaptive representative acoustic model and the renewed 

20 renewal difference model. Lastly, the speech recognition is 

performed by comparing the composite acoustic model adapted to noise 
and speaker with the feature vector series extracted from the uttered. 

As mentioned above, the present embodiment employs the 
representative acoustic model and the difference model, and the 

25 renewal difference model, which is generated by adapting the 

difference model to noise and speaker . Then, the composite acoustic 
model needed for the comparison with the feature vector series 
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obtained from the uttered speech in the speech recognition operation 
is generated by the composition of the noise adaptive representative 
acoustic model and the renewal difference model. This enables the 
generation of the composite acoustic model to be performed with 
smaller amounts of processing. 

More specifically, the processing of noise and speaker 
adaptation is not performed for all of a large number of acoustic 
models needed for the speech recognition, but for only representative 
acoustic model of each group and the difference model thereof. The 
composite acoustic model to be matched with the feature vector series 
extracted from uttered speech can be generated by composition of 
the representative acoustic model anddif f erence model, accompanying 
the noise and speaker adaptation, thereby realizing enormous 
decrease in quantity of processing. 

The first embodiment may be modified as follows. After the 
generation of the noise and speaker adaptive model, the group to 
which the noise and speaker adaptive model belongs may be changed 
based on the similarity to the noise adaptive representative acoustic 
model. The group information may be also renewed so as to correspond 
to the change of the group, and the renewal difference model be 
generated by the difference between the noise and speaker adaptive 
model and the noise adaptive representative acoustic model of the 
changed group. In this case, the speech recognition is performed 
by using the composite acoustic model generated by the composition 
of the renewed difference model and the noise adaptive representative 
acoustic model generated by the noise-adaptation of the 
representative acoustic model selected with the renewed group 



information. These renewals of both the group information and the 
difference model enable the speech recognition rate to be improved. 

According to the speech recognition apparatus and method of 
the second embodiment, a large number of acoustic models is divided 
5 into groups or clusters on the basis of the similarity, to obtain 
the group information, the representative acoustic model, and 
difference model. These models are stored corresponding to the 
identical group in the storing section 1. The present embodiment 
generates the renewal difference model adapted to noise and speaker 

10 every speech recognition during the speech recognition processing, 
and replaces the difference model in the storing section 1 by the 
renewal difference model. 

The speech recognition is performed by comparing the feature 
vector series from the uttered speech with the composite acoustic 

15 model which is generated by composition of the noise adaptive 
representative acoustic model and the renewal difference model 
improving the effect of speaker-adaptation by renewing the stored 
difference model with the renewal difference model every repetition 
of speech recognition. 

20 When replacing the difference model in the storing section 

1 by the renewal difference model , firstly, each of the noise adapt ive 
representative acoustic models is generated by noise-adaptation 
of each of the representative acoustic models stored in the storing 
section 1. 

25 Next, the composite acoustic model adapted to noise is 

generated by composition of the noise adaptive representative 
acoustic model and the difference model every group. 



Furthermore, the noise and speaker adaptive acoustic model 
is generated by executing the speaker-adaptation to the noise 
adaptive composite acoustic model with the feature vector series 
from the uttered speech. 
5 Then, the difference model in the storing section 1 is replaced 

by the renewal difference model, which is generated from the 
difference between the noise and speaker adaptive acoustic model 
and the noise adaptive representative acoustic model. 

The old renewal difference model stored in the storing section 

10 1 is replaced by the newest renewal difference model every repetition 
of the speech recognition. 

In the speech recognition, during non-utterance period, 
firstly, the adaptation of the representative acoustic model to 
environmental noise generates the noise adaptive representative 

15 acoustic model . Then, the composite acoustic model adapted to noise 
and speaker is generated by the composition of the noise adaptive 
representative acoustic model and the renewed renewal difference 
model. Lastly, the speech recognition is performed by comparing 
the composite acoustic model adapted to noise and speaker with the 

20 feature vector series extracted from the uttered speech to be 
recognized. 

As mentioned above, the embodiment employs the representative 
acoustic model, the difference model, and the renewal difference 
model, which is generated by adapting the difference model to noise 
25 and speaker. Then, the composite acoustic model needed in the speech 
recognition is generated by the composition of the noise adaptive 
representative acoustic model and the renewal difference model as 
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performed every speech recognition. Thus, this embodiment enables 
the composite acoustic model to be generated with smaller amounts 
of processing. 

The second embodiment also may be modified as follows. After 
5 the generation of the noise and speaker adaptive model, the group 
to which the noise and speaker adaptive model belongs may be changed 
based on the similarity to the noise adaptive representative acoustic 
model . The group information may be also renewed so as to correspond 
to the change of the group, and the renewal difference model be 

10 generated by the difference between the noise and speaker adaptive 
model and the noise adaptive representative acoustic model of the 
changed group. In this case, the speech recognition is performed 
by using the composite acoustic model generated by the composition 
of the renewed difference model and the noise adaptive representative 

15 acoustic model generated by the noise-adaptation of the 

representative acoustic model selected with the renewed group 
information. These renewals of both the group information and the 
difference model enable the speech recognition rate to be improved. 

According to the first and second embodiments, a remarkable 

20 reduction in an amount of processing for generating the composite 
acoustic model is obtained , as well as an improvement in processing 
speed and in the recognition rate, because the noise and speaker 
adaptive composite acoustic model to be compared with the feature 
vector series of the uttered speech, is generated by the composition 

25 of the noise adaptive representative acoustic model and the renewal 
difference model, in which the speaker-adaptation is executed to 
the difference model , using the noise adaptive representative model , 
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the difference model and the uttered speech. 

The present application claims priority from Japanese Patent 
Application No. 2002-271670, the disclosure of which is incorporated 
herein by reference. 
5 While there has been described what are at present considered 

to be preferred embodiments of the present invention, it will be 
understood that various modifications may be made thereto, and it 
is intended that the appended claims cover all such modifications 
as fall within the true spirit and scope of the invention. 
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